refinement process
Leveraging Environment Interaction for Automated PDDL Translation and Planning with Large Language Models
Large Language Models (LLMs) have shown remarkable performance in various natural language tasks, but they often struggle with planning problems that require structured reasoning. To address this limitation, the conversion of planning problems into the Planning Domain Definition Language (PDDL) has been proposed as a potential solution, enabling the use of automated planners. However, generating accurate PDDL files typically demands human inputs or correction, which can be time-consuming and costly. In this paper, we propose a novel approach that leverages LLMs and environment feedback to automatically generate PDDL domain and problem description files without the need for human intervention. Our method introduces an iterative refinement process that generates multiple problem PDDL candidates and progressively refines the domain PDDL based on feedback obtained from interacting with the environment. To guide the refinement process, we develop an Exploration Walk (EW) metric, which provides rich feedback signals for LLMs to update the PDDL file. We evaluate our approach on $10$ PDDL environments. We achieve an average task solve rate of 66\% compared to a 29\% solve rate by GPT-4's intrinsic planning with chain-of-thought prompting. Our work enables the automated modeling of planning environments using LLMs and environment feedback, eliminating the need for human intervention in the PDDL translation process and paving the way for more reliable LLM agents in challenging problems.
DisCo-Layout: Disentangling and Coordinating Semantic and Physical Refinement in a Multi-Agent Framework for 3D Indoor Layout Synthesis
Gao, Jialin, Zhou, Donghao, Liang, Mingjian, Liu, Lihao, Fu, Chi-Wing, Hu, Xiaowei, Heng, Pheng-Ann
3D indoor layout synthesis is crucial for creating virtual environments. Traditional methods struggle with generalization due to fixed datasets. While recent LLM and VLM-based approaches offer improved semantic richness, they often lack robust and flexible refinement, resulting in suboptimal layouts. We develop DisCo-Layout, a novel framework that disentangles and coordinates physical and semantic refinement. For independent refinement, our Semantic Refinement Tool (SRT) corrects abstract object relationships, while the Physical Refinement Tool (PRT) resolves concrete spatial issues via a grid-matching algorithm. For collaborative refinement, a multi-agent framework intelligently orchestrates these tools, featuring a planner for placement rules, a designer for initial layouts, and an evaluator for assessment. Experiments demonstrate DisCo-Layout's state-of-the-art performance, generating realistic, coherent, and generalizable 3D indoor layouts. Our code will be publicly available.
Small sample-based adaptive text classification through iterative and contrastive description refinement
Rajeev, Amrit, Avadhanam, Udayaadithya, Tulapurkar, Harshula, Sundar, SaiBarath
Zero-shot text classification remains a difficult task in domains with evolving knowledge and ambiguous category boundaries, such as ticketing systems. Large language models (LLMs) often struggle to generalize in these scenarios due to limited topic separability, while few-shot methods are constrained by insufficient data diversity. We propose a classification framework that combines iterative topic refinement, contrastive prompting, and active learning. Starting with a small set of labeled samples, the model generates initial topic labels. Misclassified or ambiguous samples are then used in an iterative contrastive prompting process to refine category distinctions by explicitly teaching the model to differentiate between closely related classes. The framework features a human-in-the-loop component, allowing users to introduce or revise category definitions in natural language. This enables seamless integration of new, unseen categories without retraining, making the system well-suited for real-world, dynamic environments. The evaluations on AGNews and DBpedia demonstrate strong performance: 91% accuracy on AGNews (3 seen, 1 unseen class) and 84% on DBpedia (8 seen, 1 unseen), with minimal accuracy shift after introducing unseen classes (82% and 87%, respectively). The results highlight the effectiveness of prompt-based semantic reasoning for fine-grained classification with limited supervision.
Generalized Derangetropy Functionals for Modeling Cyclical Information Flow
This paper introduces a framework for modeling cyclical and feedback-driven information flow through a generalized family of entropy-modulated transformations called derangetropy functionals. Unlike scalar and static entropy measures such as Shannon entropy, these functionals act directly on probability densities and provide a topographical representation of information structure across the support of the distribution. The framework captures periodic and self-referential aspects of information distribution and encodes them through functional operators governed by nonlinear differential equations. When applied recursively, these operators induce a spectral diffusion process governed by the heat equation, leading to convergence toward a Gaussian characteristic function. This convergence theorem provides a unified analytical foundation for describing the long-term dynamics of information under cyclic modulation. The proposed framework offers new tools for analyzing the temporal evolution of information in systems characterized by periodic structure, stochastic feedback, and delayed interaction, with applications in artificial neural networks, communication theory, and non-equilibrium statistical mechanics.
RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction
Wang, Yuchi, Cai, Yishuo, Ren, Shuhuai, Yang, Sihan, Yao, Linli, Liu, Yuanxin, Zhang, Yuanxing, Wan, Pengfei, Sun, Xu
Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.
Leveraging Environment Interaction for Automated PDDL Translation and Planning with Large Language Models
Large Language Models (LLMs) have shown remarkable performance in various natural language tasks, but they often struggle with planning problems that require structured reasoning. To address this limitation, the conversion of planning problems into the Planning Domain Definition Language (PDDL) has been proposed as a potential solution, enabling the use of automated planners. However, generating accurate PDDL files typically demands human inputs or correction, which can be time-consuming and costly. In this paper, we propose a novel approach that leverages LLMs and environment feedback to automatically generate PDDL domain and problem description files without the need for human intervention. Our method introduces an iterative refinement process that generates multiple problem PDDL candidates and progressively refines the domain PDDL based on feedback obtained from interacting with the environment.
Gaze-based Task Decomposition for Robot Manipulation in Imitation Learning
Takizawa, Ryo, Ohmura, Yoshiyuki, Kuniyoshi, Yasuo
In imitation learning for robotic manipulation, decomposing object manipulation tasks into multiple sub-tasks is essential. This decomposition enables the reuse of learned skills in varying contexts and the combination of acquired skills to perform novel tasks, rather than merely replicating demonstrated motions. Gaze plays a critical role in human object manipulation, where it is strongly correlated with hand movements. We hypothesize that an imitating agent's gaze control, fixating on specific landmarks and transitioning between them, simultaneously segments demonstrated manipulations into sub-tasks. In this study, we propose a simple yet robust task decomposition method based on gaze transitions. The method leverages teleoperation, a common modality in robotic manipulation for collecting demonstrations, in which a human operator's gaze is measured and used for task decomposition as a substitute for an imitating agent's gaze. Notably, our method achieves consistent task decomposition across all demonstrations for each task, which is desirable in contexts such as machine learning. We applied this method to demonstrations of various tasks and evaluated the characteristics and consistency of the resulting sub-tasks. Furthermore, through extensive testing across a wide range of hyperparameter variations, we demonstrated that the proposed method possesses the robustness necessary for application to different robotic systems.
On Synthetic Texture Datasets: Challenges, Creation, and Curation
Hoak, Blaine, McDaniel, Patrick
The influence of textures on machine learning models has been an ongoing investigation, specifically in texture bias/learning, interpretability, and robustness. However, due to the lack of large and diverse texture data available, the findings in these works have been limited, as more comprehensive evaluations have not been feasible. Image generative models are able to provide data creation at scale, but utilizing these models for texture synthesis has been unexplored and poses additional challenges both in creating accurate texture images and validating those images. In this work, we introduce an extensible methodology and corresponding new dataset for generating high-quality, diverse texture images capable of supporting a broad set of texture-based tasks. Our pipeline consists of: (1) developing prompts from a range of descriptors to serve as input to text-to-image models, (2) adopting and adapting Stable Diffusion pipelines to generate and filter the corresponding images, and (3) further filtering down to the highest quality images. Through this, we create the Prompted Textures Dataset (PTD), a dataset of 362,880 texture images that span 56 textures. During the process of generating images, we find that NSFW safety filters in image generation pipelines are highly sensitive to texture (and flag up to 60\% of our texture images), uncovering a potential bias in these models and presenting unique challenges when working with texture data. Through both standard metrics and a human evaluation, we find that our dataset is high quality and diverse.